Locality-aware chunk distribution #3102

SirYwell · 2025-02-07T09:21:54Z

Overview

Description

ParallelQueueExtent deals with distributing work onto multiple threads to speed up edits. In its current implementation, this is achieved by just using the Region#getChunks().iterator() as a queue and let all threads pull the next chunk to work on from there. This works well typically, but it has one downside: Different threads work on neighboring chunks. There are multiple reasons this isn't optimal:

Filters accessing data of surrounding chunks will cause more cache misses in their STQE. As currently this also means that one STQE will trim the chunk cached/processed by another STQE, we do a lot of extra work in such cases.
Due to how Minecraft generates chunks, processing neighboring chunks on different threads can be slower if both threads wait for the same resources.

By distributing work more location-aware, these downsides can be prevented, at the cost of a somewhat more complicated work-splitting logic. This means threads will process more tasks, but those tasks are smaller and faster to process. As an additional benefit, this allows better work distribution if multiple edits are done concurrently, as the threads don't have to finish one edit before they are free again for the next edit. Consequently, thread-local data needs to be cleared and re-set more often.

Exception handling is a bit of a question mark with this approach, I tried to get it near to the original handling, but I'm not sure if it makes sense the way it is right now.

Submitter Checklist

Give feedback

Make sure you are opening from a topic branch (/feature/fix/docs/ branch (right side)) and not your main branch.
Ensure that the pull request title represents the desired changelog entry.
New public fields and methods are annotated with @since TODO.
I read and followed the contribution guidelines.
Options

github-actions · 2025-02-09T09:06:38Z

Please take a moment and address the merge conflicts of your pull request. Thanks!

dordsor21

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

worldedit-core/src/main/java/com/fastasyncworldedit/core/queue/implementation/ApplyTask.java

worldedit-core/src/main/java/com/fastasyncworldedit/core/queue/implementation/QueueHandler.java

SirYwell · 2025-02-15T12:00:08Z

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

Yes, that's possible in theory, and the heuristic when to split is rather simple currently. In practice, it would mean the thread pool is already busy and there are even more tasks waiting. We could e.g. force splitting on 512x512 regions, or check ForkJoinTask.getSurplusQueuedTaskCount() > Settings.settings().QUEUE.PARALLEL_THREADS * f(this.shift) (where f is some "damping" function) to make that case even less likely.

I also though about re-checking the heuristics during processing, but that complicates things too much I think.

dordsor21 · 2025-02-15T15:15:30Z

I can see a scenario where we end up with one ApplyTask instance effectively only utilising one thread for a 512x512 region of the edit, causing the whole edit to wait for it when everything else is finished. I don't know if there's a particular way around that, but in a complex operation this could be quite a perceived performance hit

Yes, that's possible in theory, and the heuristic when to split is rather simple currently. In practice, it would mean the thread pool is already busy and there are even more tasks waiting. We could e.g. force splitting on 512x512 regions, or check ForkJoinTask.getSurplusQueuedTaskCount() > Settings.settings().QUEUE.PARALLEL_THREADS * f(this.shift) (where f is some "damping" function) to make that case even less likely.

I also though about re-checking the heuristics during processing, but that complicates things too much I think.

If we are processing a larger region by smaller region at a time in the first place, rather than the for x; for z I think it would be much more reasonable to check if we want to parallelise further at the end of the completion of each subregion, and can therefore submit the rest. E.g. create a queue at the start of a region's processing and just submit the remaining as/when needed

SirYwell · 2025-02-15T17:05:37Z

If we are processing a larger region by smaller region at a time in the first place, rather than the for x; for z I think it would be much more reasonable to check if we want to parallelise further at the end of the completion of each subregion, and can therefore submit the rest. E.g. create a queue at the start of a region's processing and just submit the remaining as/when needed

Hm I guess processing chunks using a hilbert curve might be useful there. Otherwise, just always splitting and then unforking is basically that, and the overhead of creating that many tasks eagerly is most likely not a problem.

SirYwell · 2025-02-20T16:35:18Z

I reworked the heuristic to directly process a region a bit:

With larger shifts, we fork more aggressive (e.g. on the highest level where we split into 512x512 (blocks) regions, we need to split off 32 tasks (32 region files) that aren't taken by other threads yet before processing a full region directly)
Even 32x32 (blocks) regions are split further unless there are more than 3 tasks in the queue of the current thread

With this design, I think it is very unlikely that we process overly large regions that take more time than what the pool can process currently. We can explore more dynamic approaches in future, but I'd like to start with a mainly simple solution.

dordsor21 · 2025-02-20T17:03:37Z

I reworked the heuristic to directly process a region a bit:

With larger shifts, we fork more aggressive (e.g. on the highest level where we split into 512x512 (blocks) regions, we need to split off 32 tasks (32 region files) that aren't taken by other threads yet before processing a full region directly)

Even 32x32 (blocks) regions are split further unless there are more than 3 tasks in the queue of the current thread

With this design, I think it is very unlikely that we process overly large regions that take more time than what the pool can process currently. We can explore more dynamic approaches in future, but I'd like to start with a mainly simple solution.

I suppose it's hard to make much judgement without some kinda of usage statistics from servers with a lot of players on. I don't know if we would want to effectively enforce single-threaded operation per edit. For now it's probably fine I suppose, but worth seeing if we can keep an eye on this in bstats somehow?

SirYwell · 2025-02-21T08:15:19Z

I suppose it's hard to make much judgement without some kinda of usage statistics from servers with a lot of players on. I don't know if we would want to effectively enforce single-threaded operation per edit. For now it's probably fine I suppose, but worth seeing if we can keep an eye on this in bstats somehow?

It's generally difficult to tell for me how people are using FAWE (which patterns, masks, brushes, typical edit sizes, etc) so I'd like to explore some kind of observability that can be shared (probably similar to /fawe debugpaste). But running into the scenario where an edit becomes single-threaded when multiple cores are available has an extremely low likelihood (previous long-running tasks would need to take up all but one thread and end all basically at the same time). I'll try to work on something more dynamic in the future, but that will most likely increase complexity and it might take me some weeks.

dordsor21 · 2025-02-21T09:41:27Z

Yeah I think the groundwork here is probably worth a merge for now. As it is I don't know that large edits would be done very often, and I do also wonder if perhaps for small edits (<ncpu chunks) we should just throw it in single-thread anyway

github-actions bot added the unresolved-merge-conflict label Feb 9, 2025

SirYwell added 7 commits February 12, 2025 14:54

Locality-aware chunk distribution

7f893d7

fix

b51c86f

remove jfr event

d674b6c

simplify completion logic

ce8ec83

move logic closer together

0358cc1

exception handling

5cb8265

change thread limit

0f64189

SirYwell force-pushed the perf/work-distribution branch from 130ef17 to 0f64189 Compare February 12, 2025 16:04

github-actions bot removed the unresolved-merge-conflict label Feb 12, 2025

SirYwell requested a review from a team February 12, 2025 17:43

SirYwell marked this pull request as ready for review February 12, 2025 17:43

Merge branch 'main' into perf/work-distribution

ad99f6a

dordsor21 reviewed Feb 14, 2025

View reviewed changes

SirYwell added 3 commits February 16, 2025 09:43

address comments pt.1

dc7983b

Rework fork heuristics

4424145

imports

1441184

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locality-aware chunk distribution #3102

Locality-aware chunk distribution #3102

SirYwell commented Feb 7, 2025 •

edited

Loading

Submitter Checklist

github-actions bot commented Feb 9, 2025

dordsor21 left a comment

SirYwell commented Feb 15, 2025

dordsor21 commented Feb 15, 2025

SirYwell commented Feb 15, 2025

SirYwell commented Feb 20, 2025 •

edited

Loading

dordsor21 commented Feb 20, 2025 •

edited

Loading

SirYwell commented Feb 21, 2025

dordsor21 commented Feb 21, 2025

Locality-aware chunk distribution #3102

Are you sure you want to change the base?

Locality-aware chunk distribution #3102

Conversation

SirYwell commented Feb 7, 2025 • edited Loading

Overview

Description

Submitter Checklist

github-actions bot commented Feb 9, 2025

dordsor21 left a comment

Choose a reason for hiding this comment

SirYwell commented Feb 15, 2025

dordsor21 commented Feb 15, 2025

SirYwell commented Feb 15, 2025

SirYwell commented Feb 20, 2025 • edited Loading

dordsor21 commented Feb 20, 2025 • edited Loading

SirYwell commented Feb 21, 2025

dordsor21 commented Feb 21, 2025

SirYwell commented Feb 7, 2025 •

edited

Loading

SirYwell commented Feb 20, 2025 •

edited

Loading

dordsor21 commented Feb 20, 2025 •

edited

Loading